Transliteration Mining Using Large Training and Test Sets

نویسندگان

Ali El Kahki

Kareem Darwish

Mohamed Abdul-Wahab

Ahmed Taei

چکیده

Much previous work on Transliteration Mining (TM) was conducted on short parallel snippets using limited training data, and successful methods tended to favor recall. For such methods, increasing training data may impact precision and application on large comparable texts may impact precision and recall. We adapt a state-of-the-art TM technique with the best reported scores on the ACL 2010 NEWS workshop dataset, namely graph reinforcement, to work with large training sets. The method models observed character mappings between language pairs as a bipartite graph and unseen mappings are induced using random walks. Increasing training data yields more correct initial mappings but induced mappings become more error prone. We introduce parameterized exponential penalty to the formulation of graph reinforcement and we estimate the proper parameters for training sets of varying sizes. The new formulation led to sizable improvements in precision. Mining from large comparable texts leads to the presence of phonetically similar words in target and source texts that may not be transliterations or may adversely impact candidate ranking. To overcome this, we extracted related segments that have high translation overlap, and then we performed TM on them. Segment extraction produced significantly higher precision for three different TM methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

QCRI-MES Submission at WMT13: Using Transliteration Mining to Improve Statistical Machine Translation

This paper describes QCRI-MES’s submission on the English-Russian dataset to the Eighth Workshop on Statistical Machine Translation. We generate improved word alignment of the training data by incorporating an unsupervised transliteration mining module to GIZA++ and build a phrase-based machine translation system. For tuning, we use a variation of PRO which provides better weights by optimizing...

متن کامل

Transliteration Mining with Phonetic Conflation and Iterative Training

This paper presents transliteration mining on the ACL 2010 NEWS workshop shared transliteration mining task data. Transliteration mining was done using a generative transliteration model applied on the source language and whose output was constrained on the words in the target language. A total of 30 runs were performed on 5 language pairs, with 6 runs for each language pair. In the presence of...

متن کامل

PR + RQ ≈ PQ: Transliteration Mining Using Bridge Language

We address the problem of mining name transliterations from comparable corpora in languages P and Q in the following resource-poor scenario: • Parallel names in PQ are not available for training. • Parallel names in PR and RQ are available for training. We propose a novel solution for the problem by computing a common geometric feature space for P,Q and Rwhere name transliterations are mapped t...

متن کامل

Statistical models for unsupervised, semi-supervised and supervised transliteration mining

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings, unsupervised, semi-supervised and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e. noise). The model is trained on noisy unlabelled da...

متن کامل

Transliteration System Using Pair HMM with Weighted FSTs

This paper presents a transliteration system based on pair Hidden Markov Model (pair HMM) training and Weighted Finite State Transducer (WFST) techniques. Parameters used by WFSTs for transliteration generation are learned from a pair HMM. Parameters from pair-HMM training on English-Russian data sets are found to give better transliteration quality than parameters trained for WFSTs for corresp...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Transliteration Mining Using Large Training and Test Sets

نویسندگان

چکیده

منابع مشابه

QCRI-MES Submission at WMT13: Using Transliteration Mining to Improve Statistical Machine Translation

Transliteration Mining with Phonetic Conflation and Iterative Training

PR + RQ ≈ PQ: Transliteration Mining Using Bridge Language

Statistical models for unsupervised, semi-supervised and supervised transliteration mining

Transliteration System Using Pair HMM with Weighted FSTs

عنوان ژورنال:

اشتراک گذاری